Skip to content

[linux-nvidia-6.17] Use architecture specific HBM training status register#331

Open
ankita-nv wants to merge 1 commit intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
ankita-nv:24.04_linux-nvidia-6.17-next-probe-fix-2502
Open

[linux-nvidia-6.17] Use architecture specific HBM training status register#331
ankita-nv wants to merge 1 commit intoNVIDIA:24.04_linux-nvidia-6.17-nextfrom
ankita-nv:24.04_linux-nvidia-6.17-next-probe-fix-2502

Conversation

@ankita-nv
Copy link

Blackwell-Next GPUs use a different BAR0 offset (0xAD00BC) for the HBM
training status register than GB200 (0x200BC). Add runtime detection by
reading the architecture field from PMC BOOT_42 and selecting the
appropriate offset when polling for device readiness.

Signed-off-by: Ankit Agrawal ankita@nvidia.com

@nvmochs nvmochs changed the title Use architecture specific HBM training status register [linux-nvidia-6.17] Use architecture specific HBM training status register Feb 25, 2026
@nvmochs nvmochs self-requested a review February 25, 2026 16:54
Copy link
Collaborator

@nvmochs nvmochs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@nvmochs
Copy link
Collaborator

nvmochs commented Feb 25, 2026

@ankita-nv Are there plans to upstream this patch?

@nirmoy
Copy link
Collaborator

nirmoy commented Feb 25, 2026

Acked-by: Nirmoy Das<nirmoyd@nvidia.com>

@ankita-nv
Copy link
Author

@ankita-nv Are there plans to upstream this patch?

Yeah, I'll post it shortly after internal review.

@nvmochs
Copy link
Collaborator

nvmochs commented Feb 26, 2026

Ankit requested that we hold on getting this integrated.

@ankita-nv ankita-nv force-pushed the 24.04_linux-nvidia-6.17-next-probe-fix-2502 branch 2 times, most recently from 58fc644 to 4b04466 Compare March 14, 2026 06:15
…diness check

Blackwell-Next GPUs report device readiness via the CXL DVSEC Range 1 Low
register (offset 0x1C) instead of the BAR0 HBM training register used by
GB200. The GPU memory readiness is checked by polling for the Memory_Active
bit (bit 1) for the Memory_Active_Timeout (bits 15:13).

Add runtime detection by checking the presence of the DVSEC register.
Route to the new method if present, otherwise continue using the legacy
approach.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
@ankita-nv ankita-nv force-pushed the 24.04_linux-nvidia-6.17-next-probe-fix-2502 branch from 4b04466 to f400624 Compare March 14, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants